Search CORE

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

arXiv.org e-Print Archive

Asterias: a parallelized web-based suite for the analysis of expression and aCGH data

Author: Alibes Andreu
Canada Andres
Casado David
Diaz-Uriarte Ramon
Morrissey Edward R.
Rueda Oscar M.
Yankilevich Patricio
Publication venue
Publication date: 22/10/2006
Field of study

Asterias (\url{http://www.asterias.info}) is an integrated collection of freely-accessible web tools for the analysis of gene expression and aCGH data. Most of the tools use parallel computing (via MPI). Most of our applications allow the user to obtain additional information for user-selected genes by using clickable links in tables and/or figures. Our tools include: normalization of expression and aCGH data; converting between different types of gene/clone and protein identifiers; filtering and imputation; finding differentially expressed genes related to patient class and survival data; searching for models of class prediction; using random forests to search for minimal models for class prediction or for large subsets of genes with predictive capacity; searching for molecular signatures and predictive genes with survival data; detecting regions of genomic DNA gain or loss. The capability to send results between different applications, access to additional functional information, and parallelized computation make our suite unique and exploit features only available to web-based applications.Comment: web based application; 3 figure

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest

Author: A Alibés
A Liaw
B Efron
C Ambroise
C Strobl
EJ Kontoghiorghes
H Sutter
I Foster
I Medina
J Dongarra
KH Pan
L Ein-Dor
NL Pochet
P Pacheco
R Development Core Team
R Diaz-Uriarte
R Díaz-Uriarte
R Díaz-Uriarte
R Simon
Ramón Diaz-Uriarte
RL Somorjai
S Dudoit
S Dudoit
S Michiels
S Patel
S Varma
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Microarray data are often used for patient classification and gene selection. An appropriate tool for end users and biomedical researchers should combine user friendliness with statistical rigor, including carefully avoiding selection biases and allowing analysis of multiple solutions, together with access to additional functional information of selected genes. Methodologically, such a tool would be of greater use if it incorporates state-of-the-art computational approaches and makes source code available. Results We have developed GeneSrF, a web-based tool, and varSelRF, an R package, that implement, in the context of patient classification, a validated method for selecting very small sets of genes while preserving classification accuracy. Computation is parallelized, allowing to take advantage of multicore CPUs and clusters of workstations. Output includes bootstrapped estimates of prediction error rate, and assessments of the stability of the solutions. Clickable tables link to additional information for each gene (GO terms, PubMed citations, KEGG pathways), and output can be sent to PaLS for examination of PubMed references, GO terms, KEGG and and Reactome pathways characteristic of sets of genes selected for class prediction. The full source code is available, allowing to extend the software. The web-based application is available from <url>http://genesrf2.bioinfo.cnio.es</url>. All source code is available from Bioinformatics.org or The Launchpad. The R package is also available from CRAN. Conclusion varSelRF and GeneSrF implement a validated method for gene selection including bootstrap estimates of classification error rate. They are valuable tools for applied biomedical researchers, specially for exploratory work with microarray data. Because of the underlying technology used (combination of parallelization with web-based application) they are also of methodological interest to bioinformaticians and biostatisticians.</p

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

RJaCGH: Bayesian analysis of aCGH arrays for detecting copy number changes and recurrent regions

Author: Lockwood
McCarroll
O. M. Rueda
Picard
R. Diaz-Uriarte
Rueda
Sebat
Shah
Yu
Publication venue: Oxford University Press
Publication date
Field of study

Summary: Several methods have been proposed to detect copy number changes and recurrent regions of copy number variation from aCGH, but few methods return probabilities of alteration explicitly, which are the direct answer to the question ‘is this probe/region altered?’ RJaCGH fits a Non-Homogeneous Hidden Markov model to the aCGH data using Markov Chain Monte Carlo with Reversible Jump, and returns the probability that each probe is gained or lost. Using these probabilites, recurrent regions (over sets of individuals) of copy number alteration can be found

Knowledge-based gene expression classification via matrix factorization

Author: A. M. Tomé
Affymetrix
Allison
Baldi
Barnhill
Bolstad
Breiman
Cardoso
Cardoso
Chen
D. Lutter
Diaz-Uriarte
Diaz-Uriarte
Dougherty
Dougherty
Dudoit
E. W. Lang
F. J. Theis
G. Schmitz
Galton
Galton
Golub
Guyon
Hochreiter
Irrizarry
Lee
Li
Liebermeister
Liu
Lutter
M. Stetter
Mangasarian
P. Gómez Vilda
P. Knollmüller
Pearson
Quackenbush
R. Schachtner
Saidi
Schachtner
Schachtner
Schölkopf
Simon
Spang
Talloen
Troyanskaya
Tusher
Wu
Wu
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2008
Field of study

Motivation: Modern machine learning methods based on matrix decomposition techniques, like independent component analysis (ICA) or non-negative matrix factorization (NMF), provide new and efficient analysis tools which are currently explored to analyze gene expression profiles. These exploratory feature extraction techniques yield expression modes (ICA) or metagenes (NMF). These extracted features are considered indicative of underlying regulatory processes. They can as well be applied to the classification of gene expression datasets by grouping samples into different categories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks. Results: In this study we focus on unsupervised matrix factorization techniques and apply ICA and sparse NMF to microarray datasets. The latter monitor the gene expression levels of human peripheral blood cells during differentiation from monocytes to macrophages. We show that these tools are able to identify relevant signatures in the deduced component matrices and extract informative sets of marker genes from these gene expression profiles. The methods rely on the joint discriminative power of a set of marker genes rather than on single marker genes. With these sets of marker genes, corroborated by leave-one-out or random forest cross-validation, the datasets could easily be classified into related diagnostic categories. The latter correspond to either monocytes versus macrophages or healthy vs Niemann Pick C disease patients.Siemens AG, MunichDFG (Graduate College 638)DAAD (PPP Luso - Alem˜a and PPP Hispano - Alemanas

University of Regensburg Publication Server

Repositório Institucional da Universidade de Aveiro

PuSH

Pomelo II: finding differentially expressed genes

Author: Alibes
Argraves
Barton
Baxter
E. R. Morrissey
Hokamp
Hyatt
Kapushesky
Knudsen
Luscombe
Montaner
Oliva
Patel
Potter
Psarros
R. Diaz-Uriarte
Rainer
Reiner
Romualdi
Weniger
Zhu
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

Pomelo II (http://pomelo2.bioinfo.cnio.es) is an open-source, web-based, freely available tool for the analysis of gene (and protein) expression and tissue array data. Pomelo II implements: permutation-based tests for class comparisons (t-test, ANOVA) and regression; survival analysis using Cox model; contingency table analysis with Fisher's exact test; linear models (of which t-test and ANOVA are especial cases) that allow additional covariates for complex experimental designs and use empirical Bayes moderated statistics. Permutation-based and Cox model analysis use parallel computing, which permits taking advantage of multicore CPUs and computing clusters. Access to, and further analysis of, additional biological information and annotations (PubMed references, Gene Ontology terms, KEGG and Reactome pathways) are available either for individual genes (from clickable links in tables and figures) or sets of genes. The source code is available, allowing for extending and reusing the software. A comprehensive test suite is also available, and covers both the user interface and the numerical results. The possibility of including additional covariates, parallelization of computation, open-source availability of the code and comprehensive testing suite make Pomelo II a unique tool

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

SignS: a parallelized, open-source, freely available, web-based tool for gene selection and molecular signatures for survival and censored data

Author: A Alibés
C Ambroise
C Hughes
D Turek
EJ Kontoghiorghes
F Harrell
H Li
H Li
H Sutter
HMM Bøvelstad
Hothorn
I Foster
J Dongarra
J Gui
J Gui
J Klein
J Waldo
K Asanovic
KF Fogel
KH Pan
L Kaderali
M Reich
M Schumacher
MR Segal
N Sha
P Bühlmann
P Graham
P Pacheco
P Van Roy
PJ Park
R Bair
R Development Core Team
R Diaz-Uriarte
R Díaz-Uriarte
R Díaz-Uriarte
R Simon
Ramon Diaz-Uriarte
RL Somorjai
S Dudoit
S Ma
S Ma
S Ma
S Varma
SM Baxter
SS Dave
T Hothorn
T Hothorn
T Hothorn
WN van Wieringen
Y Pawitan
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Censored data are increasingly common in many microarray studies that attempt to relate gene expression to patient survival. Several new methods have been proposed in the last two years. Most of these methods, however, are not available to biomedical researchers, leading to many re-implementations from scratch of ad-hoc, and suboptimal, approaches with survival data. Results We have developed SignS (Signatures for Survival data), an open-source, freely-available, web-based tool and R package for gene selection, building molecular signatures, and prediction with survival data. SignS implements four methods which, according to existing reviews, perform well and, by being of a very different nature, offer complementary approaches. We use parallel computing via MPI, leading to large decreases in user waiting time. Cross-validation is used to asses predictive performance and stability of solutions, the latter an issue of increasing concern given that there are often several solutions with similar predictive performance. Biological interpretation of results is enhanced because genes and signatures in models can be sent to other freely-available on-line tools for examination of PubMed references, GO terms, and KEGG and Reactome pathways of selected genes. Conclusion SignS is the first web-based tool for survival analysis of expression data, and one of the very few with biomedical researchers as target users. SignS is also one of the few bioinformatics web-based applications to extensively use parallelization, including fault tolerance and crash recovery. Because of its combination of methods implemented, usage of parallel computing, code availability, and links to additional data bases, SignS is a unique tool, and will be of immediate relevance to biomedical researchers, biostatisticians and bioinformaticians.</p

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Conditional variable importance for random forests

Author: A Bureau
Achim Zeileis
Anne-Laure Boulesteix
BJ van Os
C Strobl
C Strobl
C Strobl
Carolin Strobl
E Bauer
JH Silber
K Nicodemus
KJ Archer
KL Lunetta
L Breiman
L Breiman
L Breiman
L Breiman
L Breiman
M Nason
MR Segal
Mvan der Laan
P Bühlmann
P Good
R Development Core Team
R Diaz-Uriarte
R Diaz-Uriarte
R Feraud
SM Stigler
T Hastie
T Hothorn
TG Dietterich
Thomas Augustin
Thomas Kneib
V Svetnik
W Rodenburg
X Huang
X Xia
Y Lin
Y Qi
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Random forests are becoming increasingly popular in many scientific fields because they can cope with ``small n large p'' problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables. We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure. The resulting conditional variable importance is shown to reflect the true impact of each predictor variable more reliably than the original marginal approach

CiteSeerX

Elektronische Publikationen der Wirtschaftsuniversität Wien

Open Access LMU

Large-scale risk prediction applied to Genetic Analysis Workshop 17 mini-exome sequence data

Author: B Efron
BA Goldstein
BE Madsen
C Robert
Gengxin Li
H Zhong
Hongyu Zhao
Jia Kang
John Ferguson
Joon Sang Lee
L Almasy
L Breiman
Lun Li
R Diaz-Uriarte
R Tibshirani
T Hastie
Wei Zheng
Xianghua Zhang
Xiting Yan
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

We consider the application of Efron’s empirical Bayes classification method to risk prediction in a genome-wide association study using the Genetic Analysis Workshop 17 (GAW17) data. A major advantage of using this method is that the effect size distribution for the set of possible features is empirically estimated and that all subsequent parameter estimation and risk prediction is guided by this distribution. Here, we generalize Efron’s method to allow for some of the peculiarities of the GAW17 data. In particular, we introduce two ways to extend Efron’s model: a weighted empirical Bayes model and a joint covariance model that allows the model to properly incorporate the annotation information of single-nucleotide polymorphisms (SNPs). In the course of our analysis, we examine several aspects of the possible simulation model, including the identity of the most important genes, the differing effects of synonymous and nonsynonymous SNPs, and the relative roles of covariates and genes in conferring disease risk. Finally, we compare the three methods to each other and to other classifiers (random forest and neural network)

CNVassoc: Association analysis of CNV data using R

Author: A Caceres
B Servin
BN Howie
C Barnes
C Le Marechal
E Gonzalez
Gavin Lucas
Isaac Subirana
J Du
J Hellemans
J Marchini
JM Korn
JP Schouten
JR Gonzalez
Juan R Gonzalez
MA van de Wiel
R Development Core Team
Ramon Diaz-Uriarte
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background: Copy number variants (CNV) are a potentially important component of the genetic contribution to risk of common complex diseases. Analysis of the association between CNVs and disease requires that uncertainty in CNV copy-number calls, which can be substantial, be taken into account; failure to consider this uncertainty can lead to biased results. Therefore, there is a need to develop and use appropriate statistical tools. To address this issue, we have developed CNVassoc, an R package for carrying out association analysis of common copy number variants in population-based studies. This package includes functions for testing for association with different classes of response variables (e.g. class status, censored data, counts) under a series of study designs (case-control, cohort, etc) and inheritance models, adjusting for covariates. The package includes functions for inferring copy number (CNV genotype calling), but can also accept copy number data generated by other algorithms (e.g. CANARY, CGHcall, IMPUTE). Results: Here we present a new R package, CNVassoc, that can deal with different types of CNV arising from different platforms such as MLPA o aCGH. Through a real data example we illustrate that our method is able to incorporate uncertainty in the association process. We also show how our package can also be useful when analyzing imputed data when analyzing imputed SNPs. Through a simulation study we show that CNVassoc outperforms CNVtools in terms of computing time as well as in convergence failure rate. Conclusions: We provide a package that outperforms the existing ones in terms of modelling flexibility, power, convergence rate, ease of covariate adjustment, and requirements for sample size and signal quality. Therefore, we offer CNVassoc as a method for routine use in CNV association studiesThis work has been supported by the Spanish Ministry of Science and Innovation (MTM2008-02457 to JRG, BIO2009-12458 to RD-U and statistical genetics network MTM2010-09526-E (subprograma MTM) to JRG, IS, GL and RD-U). GL is supported by the Juan de la Cierva Program of the Spanish Ministry of Science and Innovation

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas